Skip to content

ClickBench Playground#904

Open
alexey-milovidov wants to merge 227 commits into
mainfrom
playground-wip
Open

ClickBench Playground#904
alexey-milovidov wants to merge 227 commits into
mainfrom
playground-wip

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

No description provided.

alexey-milovidov and others added 30 commits May 12, 2026 19:59
WIP checkpoint. Lets visitors run SQL against any of the 80+ ClickBench
systems via a single-page UI, each isolated in a per-system Firecracker
microVM.

  - server/  aiohttp API: /api/systems, /api/state, /api/query,
             /api/admin/provision. Owns the per-system VM lifecycle,
             a 1-Hz CPU/disk/host-pressure watchdog, and a batched
             ClickHouse-Cloud logging sink (JSONL fallback).
  - agent/   stdlib HTTP agent that runs inside each VM and wraps the
             system's install/start/load/query scripts.
  - images/  scripts to build the base Ubuntu 22.04 rootfs + per-system
             rootfs/system-disk pair (200 GB sparse + 16/88 GB sized
             for the system's data format).
  - web/     vanilla JS SPA — system picker, query box, X-Query-Time /
             X-Output-Truncated rendering.

Smoke-tested: base rootfs boots under Firecracker, agent comes up in
~2 s, /health and /stats respond. Agent self-test on the host (no VM)
covers all 4 endpoints including 10 KB output truncation. ClickHouse
provisioning is in flight; see playground/docs/build-progress.md for
the running checkpoint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A later `umount -lR` on the chroot's /dev was propagating through the
shared mount group and tearing down the host's /dev/pts, breaking sshd's
PTY allocation. `--make-rslave` keeps mount events flowing *into* the
chroot but blocks unmounts from leaking back to the host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A 16 GB guest snapshot.bin compresses to ~2 GB once we
  1) stop+start the system daemon (sheds INSERT-time heap arenas,
     buffers, fresh allocator pages),
  2) echo 3 > drop_caches (turns 3-5 GB of page cache into zero
     pages),
  3) zstd -T0 -3 --long=27 (parallel, big match window — most of
     the savings come from those zero pages).

Restart is skipped for in-process engines where stop/start is a
no-op AND the data lives in the process; wiping it would defeat
the whole point.

The host now keeps snapshot.bin.zst as the canonical artifact and
decompresses on demand right before /snapshot/load. snapshot.bin
itself is deleted after a successful restore + teardown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version threw away stdout/stderr from the pre-snapshot
stop/start cycle, so a silent failure (`sudo clickhouse start` failing
because the data dir was still locked by the dying daemon, etc.) left
us with a snapshot of a dead clickhouse-server — restored VMs then
returned "Connection refused (localhost:9000)" on every query and the
only way to recover was to manually delete the snapshot.

Capture stdout+stderr into the provision log so the failure mode is
visible via GET /provision-log, and refuse to mark PROVISION_DONE if
./check doesn't recover within the timeout. The host then sees /provision
return 500 and skips the snapshot step entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PROVISION_DONE lives on the rootfs disk (/var/lib/clickbench-agent/),
which persists across VM cold-boots. So on the second provision after
the host deleted the snapshot files, the agent saw PROVISION_DONE
already set and returned "already provisioned" — but the daemon
itself wasn't running (cold boot, no clickhouse-server in systemd),
so the host snapshotted an empty VM and every restored query came back
with "Connection refused (localhost:9000)".

Two fixes:
  1. Agent: on every startup, if PROVISION_DONE is set, kick ./start
     in a background thread. start is idempotent for the systems that
     have a daemon, so it costs nothing when the daemon is already up
     (post-restore) and brings it up when the rootfs is being re-used
     across a cold reboot.
  2. Host: when (re-)provisioning a system with no snapshot, drop the
     existing rootfs.ext4 so install/start/load run fresh. The
     system.ext4 (which holds ~14 GB of pre-staged dataset) is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cloud image ships hostname=ubuntu but /etc/hosts only maps
'localhost' to 127.0.0.1. Every sudo invocation inside the VM then
tries to reverse-resolve 'ubuntu' against the network — which has no
DNS after the snapshot drops internet — and pays the ~2 s resolver
timeout. With several sudos per ./query, that's a multi-second floor
on every query, visible in the firecracker log as repeated
'sudo: unable to resolve host ubuntu: Name or service not known'.

Mapping ubuntu to 127.0.0.1 short-circuits the lookup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mid-snapshot checksum-mismatch I attributed to "stopping the
daemon mid-merge" was actually FS corruption: KVM pauses the vcpus
the moment we call /vm Paused, and any ext4 writeback that was in
flight at that instant gets captured by the snapshot as half-flushed.
On restore the page cache references on-disk blocks that never landed,
and the next read sees a torn write.

Fix:
  1. Drop the pre-snapshot stop/start. Killing ClickHouse at any
     point never corrupts on-disk MergeTree data — only an unflushed
     FS can.
  2. Add a /sync endpoint to the agent and call it from the host
     right before /vm Paused, so all dirty pages have hit virtio-blk
     before KVM freezes the vcpus.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that the host /syncs the FS before pausing the vcpus, the snapshot
captures consistent on-disk state regardless of when the daemon exits
(MergeTree's on-disk format is durable under arbitrary process exit;
only an unflushed *filesystem* corrupts it). So we can shut the daemon
down here to evict its private heap (merge thread arenas, query cache,
mark cache, uncompressed cache, ingest buffers) and snapshot what's
left — mostly zero-fill RAM, which zstd compresses ~300:1.

Restore path is unchanged: _kick_daemon_if_provisioned at agent
startup brings the daemon back up on every cold restore. First query
in a restored VM pays a 1-2 s daemon-start cost instead of carrying
8-12 GB of memory in every snapshot.

In-process engines (chdb, polars, …) keep all state in RAM and have
no daemon to stop; for them, has_daemon is false and we skip the
stop step, falling back to drop_caches alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes for the small-snapshot path:

  1. Pass init_on_free=1 in the guest kernel cmdline. Linux normally
     leaves freed page frames with whatever bytes were last written to
     them, so the post-`clickhouse stop` free pool was ~10 GB of stale
     daemon heap and Firecracker's snapshot dump compressed only ~3:1.
     init_on_free=1 zeros every page as it goes onto the free list, so
     the snapshot's RAM region is genuinely zero-filled and zstd hits
     ~300:1.

  2. Add `_ensure_daemon_started` at the top of the agent's /query
     handler. After a snapshot restore (taken with the daemon stopped),
     the restored memory has no daemon process and `localhost:9000`
     refuses connections. The cold-boot `_kick_daemon_if_provisioned`
     only fires on actual cold boots, not on snapshot resumes, so we
     need an explicit check at query time. Lock-protected so concurrent
     /query requests don't try to ./start the daemon twice; idempotent
     and free once the daemon is up.

Also dropped the userspace _zero_free_ram hack — init_on_free does
it natively at no userspace cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end working with a 35 MB snapshot (16 GiB raw, ~470x ratio):
SELECT COUNT(*) returns 99997497 cleanly, GROUP BY URL produces the
expected top-N without any checksum errors, output truncation caps a
244 KB result at 10 KB with the right header set.

Cold path (snapshot restore + daemon start): ~10 s.
Warm path (live VM): subsecond on COUNT / MIN-MAX.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two correctness/efficiency fixes:

1. Shared read-only datasets disk. Previously each per-system rootfs
   embedded its own copy of hits.parquet / hits.tsv / hits.csv (14-75 GB
   each), so the catalog needed ~1-2 TB of redundant dataset storage on
   the host. Build one shared datasets.ext4 instead, attach to every VM
   read-only at LABEL=cbdata, and have the agent copy the bytes the
   system actually needs from /opt/clickbench/datasets into the writable
   per-system disk at provision time only. The agent uses
   os.copy_file_range so the in-VM copy is kernel-side, not bounced
   through userspace.

2. Golden-disk snapshot/restore. Firecracker's snapshot.bin only saves
   memory; the disk image referenced by the in-memory state is the
   live file. If anything modifies it between snapshots (background
   merges, log writes, /tmp churn) the next /snapshot/load points at
   the new disk while replaying old memory references. We were getting
   away with this because clickhouse-server happens to be tolerant,
   but it's fragile. Now /snapshot also renames the working disks into
   `*.golden.ext4`, and /restore-snapshot clones the goldens back into
   fresh working copies via `cp --sparse=always`. Every restore starts
   from the exact disk state captured at snapshot time.

3. Bound per-system disk builds and provisions via asyncio.Semaphore
   (PLAYGROUND_BUILD_CONCURRENCY=6, PLAYGROUND_PROVISION_CONCURRENCY=32)
   so kicking off 98 systems at once doesn't thrash the host NVMe or
   rate-limit Ubuntu mirrors.

4. Re-enabled `ursa` in the playground catalog (was incorrectly in the
   _EXTERNAL exclude list; it runs locally).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous design copied dataset files from the read-only cbdata
mount into the per-VM writable cbsystem disk on every provision —
14 GB for parquet systems, 75 GB for tsv/csv. That worked but was
redundant: the data is already on a read-only mount, the only reason
we copied was that ClickBench's load scripts do `sudo mv` and
`sudo chown` on the dataset files.

Use overlayfs instead:
  lowerdir = /opt/clickbench/datasets_ro   (RO, the shared image)
  upperdir = /opt/clickbench/system_upper  (RW per-VM disk with scripts)
  merged at /opt/clickbench/system

The system's load runs at cwd=/opt/clickbench/system. It sees scripts
+ dataset files in one tree. When it `mv`s or `chown`s a file from
the lower, overlayfs does a lazy copy-up: only the file's bytes get
materialised into the upper, and only when the script actually
mutates it. Most ClickBench load scripts `rm` the dataset file after
INSERT, which becomes a whiteout in the upper — a few bytes of
metadata, not a 75 GB copy.

Saves ~1-2 TB across the catalog on host disk (no per-system copies)
*and* eliminates the per-provision in-VM stage. Only cost: small
metadata to maintain the overlay (kilobytes).

For partitioned parquet, the source files live in
datasets_ro/hits_partitioned/ but the load globs cwd/hits_*.parquet,
so the agent creates symlinks in the upper pointing at the lower —
~100 symlinks, a few hundred bytes total.

Also: make build-datasets-image.sh idempotent. The 173 GB rsync
into datasets.ext4 only needs to run when the source dir's mtime
has changed; otherwise the cached image is reused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for the parallel-provisioning-98-systems path:

1. The _build_sem and _provision_sem fields were defined but never
   acquired — `provision-all.sh` kicked all 98 provisions at once and
   they each independently spawned build-system-rootfs.sh, which
   tried to write ~8 GB of rootfs base content × 98 in parallel
   (~780 GB of writes against a single NVMe). Disk got saturated and
   nothing finished. Use `async with self._build_sem:` and `async
   with self._provision_sem:` around the heavy phases.

2. build-system-rootfs.sh now clones the base image at block level
   with `cp --sparse=always` and resizes the filesystem to 200 GB
   in place, instead of mkfs.ext4 + mount + rsync-of-base-contents.
   The block-level clone touches only the ~2 GB of non-zero blocks
   in the base, vs. the rsync approach traversing the mounted base
   and writing every file individually. Per-system rootfs build
   goes from ~30 s to ~3 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the agent created symlinks in the overlay's upper for
partitioned parquet (hits_partitioned/* -> upper/hits_*.parquet)
because the source directory was nested. That fell apart on
clickhouse's load: `mv hits_*.parquet /var/lib/clickhouse/user_files/`
moved the symlinks, and the subsequent `chown` followed them through
to the read-only datasets disk and got `Read-only file system`.

Flatten the dataset image so all 100 partitioned parquet files sit
at the root next to hits.parquet / hits.tsv / hits.csv. The overlay
then exposes them directly at /opt/clickbench/system as real files,
no symlinks involved. clickhouse's `mv` becomes a real copy-up (and
the source becomes a whiteout in upper), and the subsequent `chown`
operates on a regular file on the rootfs — works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2 GB cap on the per-VM system disk was a holdover from the
in-VM-copy era, when system.ext4 only held scripts + staged data.
Once we switched to overlay-with-RO-datasets, system.ext4 also holds
the overlay's upperdir + workdir — i.e. every byte the load script
writes lands there, including the database's own files. ClickHouse
writes ~5 GB of MergeTree parts, DuckDB ~6 GB, Hyper ~10 GB; chown
on partitioned parquet copies up another 14 GB. 2 GB was always
going to overflow.

Match the rootfs at 200 GB (apparent). The file is sparse: truncate
reserves the size but allocates no physical blocks, mkfs.ext4 writes
~50 MB of metadata, and the snapshot/restore path uses
`cp --sparse=always` so only the bytes the VM actually wrote land
on the host disk. Light systems (chdb, sqlite, ...) cost the host
near nothing; heavy ones (tidb at ~137 GB, postgres-indexed ~80 GB)
fit without hitting ENOSPC mid-load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each per-system rootfs build was running `e2fsck -fy` on its clone
before `resize2fs`. With 98 systems and ~5 s per fsck of a 200 GB
sparse file, that's ~8 minutes of pure disk thrash during catalog
build — and entirely redundant: the base ext4 is built fresh and
never mounted dirty, so the bit-for-bit clone is clean too.

Move the single fsck to the end of build-base-rootfs.sh (where it
has all the host's I/O to itself) and skip it in the per-system
loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The base ext4 used to be built at 8 GB and each per-system rootfs
clone ran resize2fs to grow to 200 GB. resize2fs on a 200 GB file is
disk-heavy (it has to write group descriptor and bitmap metadata for
every additional block group), and we did it 98 times in parallel.

Build the base directly at 200 GB sparse with
lazy_itable_init=1,lazy_journal_init=1. mkfs writes ~50 MB of
superblock + GDT material upfront and defers the rest to lazy
background init, so the image file's physical footprint is unchanged
from the previous 8 GB layout (~1.8 GB). Per-system clones then need
only `cp --sparse=always`: no resize2fs, no e2fsck, ~1 second each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`umount` already syncs the filesystem being unmounted. The
host-wide `sync` we were calling first flushes every dirty page on
*every* mount — under 98-way parallel builds, each build's sync
blocked on every other build's writeback, multiplying the wall-clock
cost. Drop them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…olden

When clickhouse's load `mv hits.parquet /var/lib/clickhouse/user_files/`
(or any cross-FS move) copies the 14-75 GB dataset into the writable
per-VM disk and then `rm`'s it after INSERT, ext4 marks those blocks
free but the underlying virtio-blk file still carries the bytes.
`cp --sparse=always` on the golden then preserves them as random
data, so the per-system snapshot for a parquet engine carried a full
extra copy of the dataset that the load already discarded.

Adding `fstrim /opt/clickbench/sysdisk` and `fstrim /` before the
host's snapshot makes the guest issue DISCARD for free blocks; the
host loop driver responds by punching holes in the sparse backing
file (linux loop devices advertise discard with PUNCH_HOLE since 4.x,
which firecracker's virtio-blk passes through). The golden then holds
only the bytes the engine actually keeps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several systems' load scripts do `sudo mv hits_*.parquet
/var/lib/<engine>/user_files/` or `sudo cp hits.csv .../extern/`
followed by `chown` to the daemon's user. The mv/cp copies 14-75 GB
of data the daemon reads once during INSERT and we delete right
after — a complete waste of bytes on disk and time on the wire.

Replace with `ln -s` + `chown -h` where the daemon's user-files dir
is on a different filesystem from the dataset. `chown -h` chowns
the symlink itself rather than following into the (often read-only)
original; the underlying dataset is mode 644 anyway, so daemon
processes can read through the symlink as their own user.

Systems updated: clickhouse, clickhouse-tencent, pg_clickhouse,
kinetica, oxla, ursa, arc, cockroachdb.

Motivated by the ClickBench playground (Firecracker microVM service)
where the dataset is mounted read-only and shared across all VMs;
the copy step was the dominant cost on parquet/csv-format systems
and pulled 14 GB into the per-VM snapshot golden disk unnecessarily.
The change is also benign for the regular benchmark — daemons still
read the same bytes, just through a symlink.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8080 is the default HTTP admin port for cockroach, the spark UI,
trino, presto, druid, and a long tail of other JVM-based databases
in the catalog. Our in-VM agent was binding it first, so when their
./start ran the daemon failed with "bind: address already in use"
and the whole provision came down with a port conflict.

Pick 50080 — uncommon enough that no ClickBench engine in the
current catalog wants it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several systems' load scripts call ../lib/download-hits-* — e.g.
doris-parquet expects `download-hits-parquet-partitioned <doris_be_dir>`
to materialize the dataset in a specific subdirectory of the BE's
working tree. Previously we copied the lib tree into /opt/clickbench/
system/_lib, but ../lib from the system dir resolves to
/opt/clickbench/lib, not /opt/clickbench/system/_lib.

Put 4 stub scripts (one per format) at /opt/clickbench/lib in the
base rootfs. Each one symlinks from the shared RO dataset mount into
the target directory — same interface as upstream's wget-based
scripts, but instant and zero-byte-on-disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The firecracker-ci kernel is minimal: it boots fine, but Docker
fails to start because it lacks iptables/nat, br_netfilter, veth and
other modules that Docker needs to set up its bridge network. That
killed ~6 Docker-using systems (byconity, cedardb, citus, cloudberry,
greenplum) in the parallel provisioning run.

Swap in Ubuntu's `linux-image-generic` kernel (the same one Ubuntu
ships for cloud KVM guests). It has every Docker-required module
plus a much richer driver set, while still booting under Firecracker.
Trade-off: it lacks CONFIG_IP_PNP so the kernel's `ip=` boot arg is
ignored. Add a tiny clickbench-net.service that parses `ip=` from
/proc/cmdline and applies it to eth0 at boot; agent.service waits
for it. The same rootfs continues to work with the firecracker-ci
kernel (the systemd unit's `ip addr add` is idempotent — kernel-set
IPs are already there).

Verified: smoke-boot agent answered in 3 s on the new kernel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Ubuntu generic kernel builds overlay, veth, br_netfilter,
iptable_nat, nf_conntrack and friends as loadable modules, not
built-in. Without /lib/modules/<ver>/ in the rootfs the kernel can't
load them at runtime — the immediate symptom was `Failed to mount
/opt/clickbench/system` (overlayfs not available) and Docker still
failing to start (no br_netfilter/iptable_nat).

Drop the linux-modules-7.0.0-15-generic deb into the chroot,
`dpkg --unpack` it into the rootfs, run `depmod`, and pre-load the
critical modules via /etc/modules-load.d/clickbench.conf so they're
ready before any service starts. The image grew from 1.8 to 2.0 GB
physical (200 GB apparent) — modules add ~200 MB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`dpkg --unpack` records the modules package in dpkg's status DB
without configuring it; subsequent `apt-get install` calls inside
every per-system VM see an unconfigured package with unmet
dependencies and bail with "Unmet dependencies. Try 'apt --fix-broken
install'". That broke ~10 systems in the previous parallel run.

Switch to `dpkg-deb -x` — extracts the data tarball into the rootfs
without touching dpkg's DB. apt sees a normal system with all modules
in /lib/modules/, and the kernel can load them at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot of the state after the 10th parallel run. Documents:
  - what works end-to-end (microVM lifecycle, shared RO datasets disk,
    per-restore disk hygiene, fstrim before snapshot, Ubuntu kernel
    with modules)
  - bug fixes pushed during the run (port 8080 conflict, mv→ln -s,
    download-hits stubs, build/provision semaphores, redundant fsck/
    resize2fs/sync removed, clickbench-net.service, kernel module
    preload, 200 GB system disk for heavy systems)
  - failure categories observed
  - what's left for the long tail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three independent failures observed in the 10th parallel run:

1. The 7 pg_* systems (pg_clickhouse, pg_duckdb*, pg_ducklake,
   pg_mooncake) all failed to spawn firecracker with
   `Firecracker panicked at main.rs:296: Invalid instance ID:
   InvalidChar('_')`. Firecracker's --id rejects underscores. Map
   `_` to `-` for the fc id (the system name itself stays intact).

2. duckdb / chdb-dataframe / duckdb-dataframe OOM-killed at 16 GB
   ("Out of memory: Killed process 578 (duckdb) anon-rss:15926176kB").
   DuckDB and chdb hold the full dataset in memory during INSERT;
   16 GB just isn't enough for the 100 M row hits set. Bump default
   VM memory to 32 GB. KVM allocates lazily, so 98×32 GB on the host
   is fine.

3. monetdb's install fails with `$USER: unbound variable`. systemd's
   default service env has no USER/LOGNAME. Stamp them as root in
   clickbench-agent.service so subprocess.run inherits them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClickBench: fix elasticsearch load.py bytes/str mix

VM tweaks for the long tail of failures:
  - chdb-dataframe / duckdb-dataframe materialize the full hits dataset
    in process memory and need >32 GB. Default to 48 GB.
  - Druid / Pinot / similar JVM stacks take 5-10 min to come up
    (Zookeeper → Coordinator → Broker → Historical, in sequence). The
    agent's 300 s check-loop wasn't enough; widen to 900 s.

elasticsearch/load.py: gzip.open in mode='rt' returns str docs, but
bulk_stream yields bytes for ACTION_META_BYTES and str for the doc.
requests.adapters.send() calls sock.sendall() on the mixed iterable
and crashes with `TypeError: a bytes-like object is required, not
'str'`. Open in 'rb' so docs are bytes — matches the rest of the
generator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chdb-dataframe, duckdb-dataframe, polars-dataframe, daft-parquet,
daft-parquet-partitioned load the whole hits dataset into a single
in-process DataFrame. Observed peak RSS is 80-100 GB on the
partitioned parquet set — even though KVM allocates lazily,
sustaining that working set for shared use isn't feasible. Disable
them in the registry rather than bump RAM for everyone.

Revert the default per-VM RAM cap to 16 GB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
duckdb-memory's load OOM'd at 16 GB anon-rss — it's the same RAM-
resident model as duckdb-dataframe/chdb-dataframe, just packaged as
its own ClickBench entry. Add to the disabled-systems list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alexey-milovidov and others added 30 commits May 15, 2026 17:05
The Firecracker CI kernel (vmlinux-6.1.141) does not include
CONFIG_NF_TABLES — every nft call inside the VM returns
'Failed to initialize nft: Protocol not supported'. Ubuntu 24.04
defaults `update-alternatives --display iptables` to the nft
variant, and dockerd's bridge-driver startup calls
`iptables -t nat -N DOCKER`. The nft failure aborts dockerd →
docker.service exits 1/FAILURE → every docker-based system
fails at install time with
  Cannot connect to the Docker daemon at unix:///var/run/docker.sock
The legacy backend uses ip_tables / iptable_nat / xt_* modules
which the firecracker kernel does compile in (and the modules-load.d
hook here pre-loads).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ts_partitioned/

build-datasets-image.sh rsyncs /opt/clickbench-playground/datasets/
verbatim, so the partitioned parquet files end up at
/opt/clickbench/datasets_ro/hits_partitioned/hits_N.parquet inside
the VM. The lib stub was linking from
/opt/clickbench/datasets_ro/hits_N.parquet (no subdir) — every
symlink dangled and every partitioned-parquet load script failed
with 'No files found that match the pattern \"hits_*.parquet\"'.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After fixing the iptables-nft → legacy default, dockerd installs and
starts cleanly. `docker run` then fails with:
  iptables v1.8.10 (legacy): can't initialize iptables table 'raw':
  Table does not exist (do you need to insmod?)
because modprobe doesn't auto-load every iptables filter table on
demand inside a stripped-down firecracker rootfs. dockerd's
DIRECT ACCESS FILTERING uses the `raw` table; we already pre-load
`iptable_nat`, so add `iptable_raw`, `iptable_filter`,
`iptable_mangle`, and `xt_conntrack` to the list.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…p layout

These three were added to the playground before being rewritten in
the PR #860 split-benchmark.sh refactor — they still carried the
old monolithic benchmark.sh + run.sh. Replace benchmark.sh with the
thin shim that sources lib/benchmark-common.sh, drop run.sh, and
add a data-size script measuring fb-volume (the bind-mounted
firebolt-core data directory).

install/start/check/load/query/stop already existed from when we
wrote them per-step originally; this only catches the metadata
files up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dockerd 28+ added "DIRECT ACCESS FILTERING": iptables -t raw DROP
rules to block traffic going directly to container IPs. The
Firecracker CI kernel doesn't compile in CONFIG_IP_NF_RAW, so
'iptables -t raw -A PREROUTING' fails with 'Table does not exist'
and 'docker run' on the default bridge exits 125.

Write /etc/docker/daemon.json setting the bridge driver's
gateway_mode_ipv4/ipv6 = nat-unprotected. Container traffic still
masquerades via the `nat` and `filter` tables (which the kernel
does have); we lose the extra "host-bypass DROP" layer that's
fine to skip in a sandboxed single-container microVM.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The gateway_mode_ipv4=nat-unprotected attempt didn't take effect for
the auto-created default `bridge` network on docker.io 29.x — every
docker run still tries to insert a `raw`-table DROP rule and fails
with 'Table does not exist'. Set iptables=false in daemon.json:
dockerd stops touching iptables altogether, port forwarding goes
through the userland docker-proxy (which works fine for our
single-container-per-VM use case), and the host-side
net.enable_filtered_internet path still handles VM→upstream
masquerade.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…heck timeout

The 10-min wait-for-ready loop just printed
  firebolt-core did not become healthy in 10 min
with zero context, so subsequent re-kicks were blind. Add
docker ps / inspect / logs / ss listener / curl probe on the
failure path so the provision log carries enough to triage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The diagnostic dump showed firebolt-core refusing to start with:
  The directory '/firebolt-core/volume/' (owner 0:0, permissions
  755) is not readable or writeable by the Firebolt Core process
  (running as effective user 1111, effective group 1111).
The agent provisions as root, so the bind-mounted host dir lands
as root:root; firebolt-core inside the container is uid 1111 and
won't initialize the engine. chown the host-side dir to 1111:1111
before docker run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… setup

Reflink + transparent zstd are both native on btrfs, so the
two-phase reflink-then-zstd snapshot dance is no longer needed:
revert _snapshot_disks/_restore_disks to plain reflink and let the
filesystem handle compression. Update install-firecracker.sh to
document mkfs.btrfs + compress=zstd:1 as the recommended host
setup; XFS still works for reflink but lacks compression and fills
the host at ~7 TB.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Upstream ClickBench keeps the 100 hits_N.parquet partitioned files
under hits_partitioned/; load scripts glob `hits_*.parquet` at cwd,
not from a subdir. The agent relies on overlay magic for staging
(lowerdir=datasets_ro, cwd=/opt/clickbench/system), and that
surfaces files at root of the dataset image but leaves
hits_partitioned/ as a subdir — the glob then matches nothing.

Symptom: clickhouse / pg_clickhouse / ursa / daft-parquet-partitioned
/ duckdb-parquet-partitioned / duckdb-vortex-partitioned all
hit 'No files found that match the pattern "hits_*.parquet"' (or
the dialect-specific equivalent) at load time.

Materialise the per-file symlinks in cwd in the agent rather than
in each system's load script so the 6+ partitioned consumers don't
each reimplement the same staging step (which historically rotted
when one or two were updated and the rest weren't — upstream
centralised this in lib/download-hits-* for the same reason).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ClickHouse v26.x canonicalises the filesystem-cache path before
the policy check that 'absolute path must lie inside
/var/lib/clickhouse/caches/'; an older trick of pointing
caches/web at /dev/shm via symlink is now rejected with
BAD_ARGUMENTS at CREATE TABLE time.

Bind-mount /dev/shm/clickhouse onto /var/lib/clickhouse/caches/web
so the kernel-canonicalised path stays inside caches/ but the
underlying bytes still live in tmpfs (the whole point — cold
queries pull ~1 GB into the cache and we don't want that on the
host SSD).

Also clean up a leftover symlink from previous install runs
before the mkdir/mount so re-running install is idempotent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
First-boot initdb inside the cedardb container runs through
'Fixing permissions on existing directory' and 'Setting up
database directory' phases that take 90-120 s on cold disk
before postgres actually listens. The 60 s budget bailed during
that window, leaving the system in start-failed and never
snapshotted.

pg_isready exits fast once the daemon is up, so the longer
timeout only changes behaviour in the failure path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tch)

Provisioning always starts on a fresh per-VM rootfs, so the prior
symlink-cleanup + mountpoint guard added nothing and just made
the script noisier.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
v2.5.12 has an Arrow type-inference bug under static-schema mode:
incoming JSON integers are inferred as Float64 even when the row
fits in Int64, and every /ingest with an Int64-declared field
fails with 400 "Fail to merge schema field 'X' because the from
data_type = Float64 does not equal Int64". The load script's
parallel ingest loop hit this on the very first chunk and logged
~5000 'curl: (22) HTTP 400' lines while loading zero rows; queries
then returned 0 for everything.

Verified the fix locally: v2.7.2 accepts the bundled
static_schema.json and the playground's hits.json shape — single
row ingest returns 200, COUNT(*) and AVG(UserID) both produce the
expected values.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-all)

The host's FORWARD policy is ACCEPT (Docker would flip it but we
disable Docker's iptables management in the VM rootfs, and we don't
want to flip the global policy ourselves — it would break unrelated
host forwarding). disable_internet was only stripping the per-slot
ACCEPTs and the POSTROUTING MASQUERADE, leaving every other packet
to fall through to the default ACCEPT.

Practical exploit: a VM with arbitrary code execution exposed to
the benchmark consumer (pandas, polars, dataframe variants) could
curl 169.254.169.254/latest/api/token and get a real IMDSv2 token
— the AWS hypervisor responds to the VM's RFC1918 source address
even without our MASQUERADE rule, and the reply gets forwarded
back the same way through the still-ACCEPT default policy. From
there an attacker can read the EC2 instance role's credentials.
Datalake systems are accidentally safe (the PREROUTING REDIRECT
to the SNI proxy catches TCP/80 before FORWARD, and the proxy's
Host-header allowlist rejects 169.254.169.254) but every other
system was wide open.

Refactor: introduce _strip_slot(slot) that parses `iptables -S`
output and removes every rule mentioning the slot's TAP or CIDR.
Each enable/disable function calls it first, then installs its
own rules — no more order-dependent interaction where a stale
catch-all DROP from one mode silently blocks the next mode's
ACCEPT. disable_filtered_internet is no longer needed (subsumed
by _strip_slot) and goes away.

disable_internet now installs explicit `-i tap -j DROP` and
`-o tap -j DROP` so isolation no longer relies on the chain's
default policy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The line `sys.stderr.write(f"{o[\"elapsed\"]}\n")` parses fine on
older Pythons that lex f-string contents textually but breaks on
Python 3.12+ where PEP 701 parses the brace contents as a real
expression — and a backslash inside a Python expression (outside a
string literal) is invalid, so every query failed with
"unexpected character after line continuation character" before
even reaching the server.

Drop the f-string for plain str() concatenation; no quote-nesting,
no version-dependent lexer quirk.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_build_images_if_needed short-circuits when both rootfs.ext4 and
system.ext4 already exist, on the assumption that re-cloning costs
disk for no benefit. That's wrong whenever base-rootfs.ext4 has
been rebuilt since: the in-VM agent and the lib/download-* stubs
live in the base, and the per-system scripts live in the sysdisk
upper — and both stay stale.

Concrete bite: today's agent change to stage partitioned parquet
symlinks at cwd shipped in base-rootfs.ext4 at 18:05, but every
already-provisioned partitioned system that we re-kicked
afterwards (datafusion-partitioned and friends) booted off the
pre-fix rootfs.ext4 from 15:39, ran the OLD agent that doesn't
stage anything, and the load script's `mv hits_*.parquet
partitioned/` matched zero files — leaving the parquet
external-table empty and every query failing with
'No field named "EventDate"' / 'table hits not found'.

Fix: compare mtimes; if base is newer, drop both the rootfs and
the sysdisk so build-system-rootfs.sh runs and re-rsyncs both.
On btrfs `cp --sparse=always` is a reflink — re-cloning a 200 GB
sparse rootfs is near-instant, so the conservative invalidation
isn't expensive.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The README and architecture doc were conceptual; nothing walked
through "from a blank Ubuntu 24.04 box to a serving playground".
INSTALL.md does, in order: format btrfs + zstd, clone repo, set
up sudoers, install firecracker/kernel/DNS/(optional)TLS, download
datasets, build datasets image, build base rootfs, configure
ClickHouse Cloud logging, start the server, provision the catalog.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cross-system tooling (playground sweep, agent /query) keys off
queries.sql by filename even when the contents aren't SQL. siglens
ships SPL/Splunk QL but the file extension was producing
NO_QUERIES misses in every catalog-wide sweep. Renaming aligns
with every other system in the repo; the contents are unchanged,
and benchmark.sh already declared BENCH_QUERIES_FILE accordingly
(now matches reality, the override line is unnecessary but harmless).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two playground-agent behaviours used to be controlled by separate
mechanisms — an opaque .preserve-state file in the system dir for
"skip the pre-snapshot stop+start cycle" and nothing at all for
"force ./stop after snapshot restore". Both are now driven by
per-system variables in benchmark.sh, the same surface that
already exposes BENCH_DOWNLOAD_SCRIPT / BENCH_DURABLE /
BENCH_QUERIES_FILE.

  PLAYGROUND_SKIP_RESTART_BEFORE_SNAPSHOT=yes
    The loaded state lives only in the daemon's process memory
    (pandas / polars / duckdb-dataframe / daft-parquet / chdb-
    dataframe / polars-dataframe — and pinot / tidb which have
    slow JVM/cluster bring-up worth snapshotting hot). Stopping
    pre-snapshot would wipe the in-process DataFrame and the
    restored snapshot would serve queries against a daemon whose
    `hits = None`. Replaces the .preserve-state marker file.

  PLAYGROUND_RESTART_AFTER_RESTORE_SNAPSHOT=yes
    After a firecracker memory snapshot+restore the cluster's
    internal connections (brpc, gossip) are stale; the system's
    ./start does a shallow health probe ("SELECT 1" against the
    local node) and short-circuits, leaving the broken cross-node
    connections in place — every subsequent query then fails with
    "Connection refused" / "no available searcher nodes in the
    cluster". byconity and quickwit both showed this; opting them
    in causes the agent to force ./stop on btime shift before the
    next ./start so the bring-up is from a clean state.

Agent reads the vars by grep, NOT by sourcing benchmark.sh (which
ends with `exec ../lib/benchmark-common.sh`). Both vars live next
to BENCH_DURABLE in the per-system shim, so the contract stays in
one file.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…fix)

When we set `iptables: false` in /etc/docker/daemon.json (to work
around the missing kernel CONFIG_IP_NF_RAW on the firecracker
guest kernel — Docker 28+'s DIRECT ACCESS FILTERING insists on
the raw table), dockerd stopped installing its usual nat-table
rule:

  -t nat -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE

Container-originated packets then leave the VM with their docker0
source intact (172.17.0.x). The host's per-slot MASQUERADE
matches only the VM TAP CIDR (10.200.X.0/24), so the 172.17.0.x
packet exits ens1 unchanged and AWS drops it. Empirically:
presto-datalake's load failed with `Name or service not known`
for clickhouse-public-datasets.s3.eu-central-1.amazonaws.com, and
cloudberry's install failed inside a Rocky Linux container with
`Could not resolve host: mirrors.rockylinux.org`.

Replicate the missing rule via a small systemd unit that runs
after docker.service. The nat table is intact (it's `raw` that
isn't compiled in), so MASQUERADE works fine.

Also:
- cedardb / cedardb-parquet: bump start-ready timeout 300s → 600s
  (the container's initdb takes longer than 5 min on the cold
  sysdisk; this was the proximate cause of two HEALTHCHECK-TIMEOUT
  failures in the last sweep).
- trino-datalake / trino-datalake-partitioned: set
  BENCH_CHECK_TIMEOUT=1800. Trino's cold JVM bootstrap pushes past
  the lib's 300 s default, then keeps going for several more
  minutes; both variants timed out at the 900 s ./check budget.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cedardb base variant got bumped to 600s in the last commit but
cedardb-parquet still had the older 60s, so it would have hit the
same HEALTHCHECK-TIMEOUT failure mode again.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The agent waited up to a hardcoded 900 s for ./check to succeed
after ./start, regardless of what the per-system benchmark.sh
declared. trino-datalake / trino-datalake-partitioned bumped
BENCH_CHECK_TIMEOUT=1800 to cover Trino's cold-JVM bootstrap, but
the agent ignored it and bailed at 900 s — exactly the
"check did not succeed within 900s" we saw.

Read the override via the same _bench_var() grep that handles
PLAYGROUND_SKIP_RESTART_BEFORE_SNAPSHOT etc., and clamp to a
floor of 900 s so the existing baseline still covers Druid /
Pinot / similar JVM stacks that don't declare an override.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The upstream cap is sized for a bare-metal benchmark machine.
Playground VMs have 16 GiB RAM total, so a 27 GB RAM tier
overshoots physical memory; kinetica's rank-1 worker gets
OOM-killed mid-LOAD and the load fails with
`[GPUdb]executeSql: Internal_Error: Rank 1 non-responsive
(Table:"ki_home.hits")`. Keeping 7 GiB of headroom for the
agent, dockerd, and the rest of the kinetica plane keeps the
load on the disk tier and the load completes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every OOM during ./load just printed
  psql:create.sql:109: ERROR:  unable to allocate memory
and we couldn't tell whether the agent's mkswap+swapon actually
ran, whether the container saw the swap, or whether the sysctl
tweaks (overcommit_memory=1, max_map_count, swappiness) stuck.
With umbra in NEEDS_SWAP and a 256 GiB swap.raw attached, OOM
shouldn't be possible — but it is, so dump enough state at the
end of ./start that the next failure tells us where to look.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Trino's cold-start in the datalake configuration (hive catalog,
S3 credentials shim) ran past the 1800s budget on the last
provision. trino (non-datalake) and trino-partitioned snapshot
fine on the same 900s default, so the slowdown is specific to
the catalog/S3 config — give the cold path another 30 min and
revisit with diagnostics if it still doesn't land.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Umbra's COPY consistently ENOMEMs ~9 min into the load on the
default 16 GiB VM, even with NEEDS_SWAP (256 GiB swap.raw active,
overcommit_memory=1, no docker cgroup memory cap). The
diagnostic dump confirmed swap is mounted and the container's
memory.max / memory.swap.max are 'max', so the kernel isn't the
one refusing — umbra's own allocator hits a wall at the working-
set peak before the swap path can catch up.

Add VM_MEM_OVERRIDES_MIB in systems.py and have vm_manager pull
mem_size_mib from it (falling back to the host's vm_mem_mib).
Bump umbra to 32 GiB; the COPY then finishes, the snapshot carries
the warm working set, and restored queries don't pay reload cost.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Diagnostic dump from the last failure showed swap mounted, swap
unused, container memory.max/swap.max both 'max'. The remaining
hypothesis for the ENOMEM is umbra calling mlock() on a chunk
bigger than the 8 MiB RLIMIT_MEMLOCK we explicitly set — mlock
returns ENOMEM independent of how much swap is available, since
locked pages by definition can't be paged out.

- Switch the docker --ulimit from memlock=8388608 to memlock=-1
  (unlimited).
- Also dump vm.overcommit_memory / .swappiness / .max_map_count
  and the container's effective `ulimit -l` so the next failure
  conclusively tells us whether the sysctl tweaks stuck and what
  the container actually sees.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ntainer

`./load` does `ln -f hits_*.parquet data/hits/` to populate the
hive `external_location`. With the agent now staging partitioned
parquet as symlinks at cwd pointing to
/opt/clickbench/datasets_ro/hits_partitioned/hits_N.parquet, GNU
ln's default behavior (`-P`) creates a hardlink to the SYMLINK
inode rather than dereferencing — so `data/hits/hits_N.parquet`
is a hardlink to a symlink whose target is an absolute host-VM
path the container can't see. Inside the trino/presto container
the symlinks all dangle, the hive external_location appears
empty, and queries return 0 rows.

Add `-v /opt/clickbench/datasets_ro:/opt/clickbench/datasets_ro:ro`
to both containers so the absolute symlink targets resolve from
inside the container too.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant